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Abstract 

We establish a generic theoretical tool to construct probabilistic bounds 
for algorithms where the output is a subset of objects from an initial pool 
of candidates (or more generally, a probability distribution on said pool) . 
This general device, dubbed "Occam's hammer", acts as a meta layer 
when a probabilistic bound is already known on the objects of the pool 
taken individually, and aims at controlling the proportion of the objects in 
the set output not satisfying their individual bound. In this regard, it can 
be seen as a non-trivial generalization of the "union bound with a prior" 
("Occam's razor"), a familiar tool in learning theory. We give applica- 
tions of this principle to randomized classifiers (providing an interesting 
alternative approach to PAC-Bayes bounds) and multiple testing (where 
it allows to retrieve exactly and extend the so-called Benjamini-Yekutieli 
testing procedure). 

1 Introduction 



In this paper, we establish a generic theoretical tool allowing to construct prob- 
abihstic bounds for algorithms which take as input some (random) data and 
return as an output a set A of objects among a pool Ti, of candidates (instead 
of a single object h ^ TL vn the classical setting). Here the "objects" could be 
for example classifiers, functions, hypotheses. . . according to the setting. One 
wishes to predict that each object h in the output set A satisfies a property 
R{h,a) (where a is an ajustable level parameter); the purpose of the proba- 
bilistic bound is to guarantee that the proportion of objects in A for which the 
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prediction is false does not exceed a certain value, and this with a prescribed 
statistical confidence 1 — i5. Our setting also covers the more general case where 
the algorithm returns a (data-dependent) probability density over TL. 

Such a wide scope can appear dubious in its generality at first and even seem 
to border with abstract nonsense, so let us try to explain right away what is 
the nature of our result, and pinpoint a particular example to fix ideas. The 
reason we encompass such a general framework is that our result acts as a 'meta' 
layer: we will pose that we already have at hand a probabilistic bound for single, 
fixed elements h G Ti.. Assuming the reader is acquainted with classical learning 
theory, let us consider the familiar example where 7i is a set of classifiers and 
we observe an i.i.d. labeled sample of training data as an input. For each 
fixed classifier ft, e 7i, we can predict with success probability at least 1 — S the 
property R{h, S) that the generalization error of h is bounded by the training 
error up to a quantity e{d), for example using the Chernoff bound. In the 
classical setting, a learning method will return a single classifier h E H. If 
nothing is known about the algorithm, we have to resort to worst-case analysis, 
that is, obtain a uniform bound over Ti; or in other terms, ensure that the 
probability that the predicted properties hold for all h E H is at least 1 — 5. 
The simplest way to achieve this is to apply the union bound, combined with 
a prior tt on 7i (assumed to be countable in this situation) prescribing how 
to distribute the failure probability S over TC. In the folklore, this is generally 
referred to as Occam's razor bound, because the quantity — log(7r(/i)), which can 
be interpreted as a coding length for objects h G Ti., appears in some explicit 
forms of the bound. 

The goal of the present work is to put forward what can be seen as an ana- 
logue of the above "union bound with a prior" for the set output (or probability 
output) case, which we call Occam's hammer by remote analogy with the prin- 
ciple underlying Occam's razor bound. Occam's hammer relies on two priors: 
a complexity prior similar to the razor's (except it can be continuous) and a 
second prior over the output set size or inverse output density. We believe that 
Occam's hammer is not as immediately straightforward as the classical union 
bound, and hope to show that it has potential for interesting applications. For 
reasons of space, we will cut to the chase and first present Occam's hammer 
in an abstract setting in the next section (the reader should keep in mind the 
classifiers example to have a concrete instance at hand) then proceed to some 
applications and a discussion about tightness. A natural application field is 
multiple testing, where we want to accept or reject (in the classical statistical 
sense) hypotheses from a pool H; this will be developed in section The 
present work was motivated by the PASCAL theoretical challenge ^ on this 
topic. 
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2 Main result 



2.1 Setting 

Assume we have a pool of objects which is a measurable space {H, Sj) and observe 
a random variable X (which can possibly represent an entire data sample) from 
a probability space {X, X, P). Our basic assumption is: 

Assumption A: for every h G H, and 6 € [0,1], we have at hand a set 
B{h, 5) G X such that ¥x^p [X e B{h, S)] < S. Wc call B{h, S) "bad event at 
level 5 for /i" . Moreover, we assume that the function {x,h,6) G XxHx [0, 1] 
l{x G B{h,6)} is jointly measurable in its three variables. Finally, we assume 
that for any h E H wc have B{h, 0) = 0. 

It should be understood that "bad events" represent regions where a cer- 
tain desired property does not hold, such as the true error being larger than 
the empirical error plus e{S) in the classification case. Note that this 'desir- 
able property' implicitly depends on the assigned confidence level 1 — 5. We 
should keep in mind that as 6 decreases, the set of observations satisfying the 
corresponding property grows larger, but the property itself loses significance 
(as is clear once again in the generalization error bound example). Of course, 
the 'properties' corresponding to i5 = or 1 will generally be trivial ones, i.e. 
B{h, 0) = and B{h, 1) = X. Let us reformulate the union bound in this setting: 

Proposition 1 (Abstract Occam's razor). Let n be a prior probability dis- 
tribution on 7i and assume (A) holds. Then 

Px~p p/i e X e B{h, 6TT{{h}))] < 6. 

In particular, for any algorithm taking X as an input and returning hx € Tl as 
an output (in a measurable way as a function of X ), we have 

Px^p [X (.B{hx.5^{{hxm<5■ 
Proof. In the first inequality we want to bound the probability of the event 

U H(/.,MW))- 

hen 

Since we assumed B(h, 0) = the above union can be reduced to a countable 
union over the set {h gH : 7r({/i}) > 0}. It is in particular measurable. Then, 
we apply the union bound over the sets in this union. The event in the second 
inequality can be written as 

IJ {{X ■.hx = h}nB{h,6w{{h}))). 
hen 

It is measurable by the same argument as above, and a subset of the first 
considered event. □ 

Note that Occam's razor is obviously only interesting for atomic priors, and 
therefore essentially only useful for a countable object space H. 
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2.2 False prediction rate 



Let us now assume that we have an algorithm taking X as an input and return- 
ing as an output a subset Ax C H; we assume the function {X, h) G X x H i-^ 
l{h G Ax} is bimeasurable. What we are interested in is upper bounding the 
proportion of objects in Ax faUing in a "bad event". Here the word 'propor- 
tion' refers to a volume ratio, where volumes are measured through a reference 
measure fj, on (Ti,:^). Like in Occam's razor, we want to allow the set level to 
depend on h and possibly on Ax- Here is a formal definition for this: 

Definition 1 (False prediction rate). Pose assumption (A). Let a function 
A : Ti. X R_(_ — > [0,1], jointly measurable in its two parameters, be fixed, called 
the level function. Let fi be a volume measure on Ti.; we adopt the notation 
\S\ = fJ-{S) for S G Sj. We define the false prediction rate for level function A 
as 

(X A\ \An{hen:X&B(h,A(h,\A\))}\ . 

PA (A, A) = , if \A\ e (0,oo); 

and pa{X, A) = 0, if \A\ = or \A\ = oo. 

The name false prediction rate was chosen by reference to the notion of false 
discovery rate (FDR) in the multitesting framework (see below more details in 
section IS^ . We will drop the index A to lighten notation when there is no 
ambiguity from the context. The pointwise false discovery rate for a specific 
algorithm X i-^ Ax is therefore p{X,Ax)- In what follows, we will actually 
upper bound the expected value Ex ^x)] over the drawing of X. In some 
cases, controlling the averaged FPR is a goal of its own right. Furthermore, 
if we have a bound on Ex [p], then we can apply straightforwardly Markov's 
inequality to obtain a confidence bound over p: 

Ex [piX, Ax)] < 7 ^ p{X, Ax) < 7(^"^ with probabihty 1 - S. 



2.3 Warming up: algorithm with constant volume output 

To begin with, let us consider the easier case where the set ouput given by the 
algorithm has a fixed size, i.e. \Ax \ = a is a constant instead of being random. 

Proposition 2. Suppose assumption (A) holds and that {X, h) £ X x Ti. t-^ 
l{h £ Ax} is bimeasurable.. Assume \Ax\ = p{Ax) = a a.s. Let n be a 
probability density function on Ti with respect to the measure p,. Then putting 
A(/i, \A\) = min(^a7r(/i), 1), it holds that 

Ex^p [p{X,Ax)]<&. 
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Proof: Obviously, A is bimeasurable. We then have 



Ex^P [p{X, Ax)] = Ex^P [a-^\Ax n{heH,Xe B{h, A{h, \Ax\))} |] 
< Ex^p [\{heH:Xe B{h, uAn{5a-K{h), 1))} |] 

= I fxr^p[B{h,mm{5aTi{h),l))]d^i{h)a-^ 



As a sanity check, consider a countable set Ti with /i the counting measure, 
and an algorithm returning only singletons. Ax — {hx}, so that \Ax\ = 1- 
Then in this case p G {0, 1}, and with the above choice of A, we get p{X, {h}) = 



1{X e B{h,5-K[h))}. Therefore, Ex [p{X,Ax)] = ¥x[X e B{hx,STT{hx))] < S, 



i.e., we have recovered Occam's razor. 
2.4 General case 

The previous section might let us hope that A{h, \A\) — 6\A\Tr{h) would be a 
suitable level function in the more general situation where the size \Ax\ is also 
variable; but things get more involved. The observant reader might have noticed 
that, in Proposition[21 the weaker assumption \ Ax \ > a a.s. is actually sufficient. 
This thefore suggests the following strategy to deal with variable size of Ax- (1) 
consider a discretization of sizes through a decreasing sequence (ofe) converging 
to zero; and a prior 7 on the elements of the sequence; (2) apply Proposition |21 
for all k with (afe,7(afc)(5) in place of (a, (5); (3) define A(/i, \A\) = S'K(h)ai;j{ak) 
whenever \A\ G [a/j,afc_i); then by summation over k (or, to put it differently, 
the union bound) it holds that K[p] < 6 for this choice of A. 

This is a valid approach, but we will not enter into more details concerning 
it; rather, we propose what we consider to be an improved and more elegant 
result below, which will additionally allow to handle the more general case where 
the algorithm returns a probability distribution over Ti. instead of just a subset. 
However, we will require a slight strengthening of assumption (A): 

Assumption A': like assumption (A), but we additionaly require that for 
any h G Ti, B{h,S) is a nondecreasing sequence of sets as a function of (5, i.e., 



The assumption of nondecreasing bad events as a function of their probability 
seems quite natural and is satisfied in the applications we have in mind; in 
classification for example, bounds on the true error are nonincreasing in the 
parameter S (so the set of samples where the bound is violated is nondecreasing) . 
We now state our main result (proof found in Appendix): 

Theorem 1 (Occam's hammer). Pose assumption (A') satisfied. Let: 

(i) p be a nonnegative reference measure on Ti. (the volumic measure^; 

(ii) TT be a probability density function with respect to p (the complexity 



Eh^f, [Px^p [B{h, 5a)]] a 



-1 




B{h,d) C B{h,6') for S < 6'. 



prior ); 
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(Hi) J be a probability distribution on (0, +00) (the inverse density prior^. 
Put P{x) = Jq ud'yiu) for x £ (0, +cxd). Define the level function 

A{h, 6) = min(57r(/i)/3(6l-i), 1). 

Then for any algorithm X i-^ 9x returning a probability density Ox over H. with 
respect to fj,, and such that {X,h) 1— > 9x{h) is bimeasurable, it holds that 

Px~p,/.~9x.M [X G Bih,A{h,exih)))] < S. 

Comments: an algorithm returning a probability density distribution over 
7i is more general than an algorithm returning a set, as the latter case can be 
cast into the former by considering a constant density over the set, 0A{h) = 
G A}. This specialization gives a maybe more intuitive interpretation 
of the inverse density prior 7, which then actually becomes a prior on the vol- 
ume of the set output. We can thus recover the case of constant set volume a of 
Proposition 121 by using the above specialization and taking a Dirac distribution 
for the inverse density prior, 7 = ^a- In particular, Occam's razor is a spe- 
cialization of Occam's hammer (up to the minor strengthening in assumption 
(A')). 

To compare with the "naive" strategy described earlier based on a size dis- 
cretization sequence (afc), we get the following advantages: Occam's hammer 
also works with the more general case of a probability output; it avoids any dis- 
cretization of the prior; finally, if even we take the discrete prior 7 = jk^ak in 
Occam's hammer, the level function for \A\ G [ofe, ak~i) will be proportional to 
the partial sum X]j<fc instead of only the term "fko-k in the naive approach 
(remember that the higher the level function, the better, since the corresponding 
'desirable property' is more significant for higher levels). 

3 Applications 

3.1 Randomized classifiers: an alternate look at PAC- 
Bayes bounds 

Our first application is concerned with our running example, classifiers. More 
precisely, assume the observed variable is actually an i.i.d. sample S = (Xi, Yi)"^-^, 
and 7i is a set of classifiers. Let £{h), resp. £{h,S) denote the generalization, 
resp. training, error. We will consider a randomized classification algorithm, 
consisting in selecting a probability density function 6s on 7i based on the sam- 
ple, then drawing a classifier at random from Ti. using the distribution ^g./x, 
where fi is here assumed to be a reference probability measure. For example, we 
could return the uniform density on the set of classifiers Ag C Ti having their 
empirical error less than a (possibly data-dependent) threshold. We obtain the 
following result: 

Proposition 3. Let ^ be a probability measure over Ti; for any algorithm S 1— > 
9s returning a probability density 9s over H. (wrt. fj,), if hs is a randomized 
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classifier drawn according to Os-^J,, the following inequality holds with probability 
1 — S over the draw of S and hs '■ 

D+{£{hs,S)\\£{hs)) < H , 

n n — 1 

where log^ is the positive part of the logarithm; and = q\og^ + (1 — 

q) log if q < p and otherwise. 

Proof Define the bad events B{h,S) = : D+{£{h, S)\\£{h)) < ^^^j, sat- 
isfying assumption (A') by Chernoff's bound (see, e.g., |[6J; choose tt = 1 
and 7 the probabihty distribution on [0, 1] having density ■:^;^x~^~^"^ , so that 
P{x) = i min(a;"-i , 1), and apply Occam's hammer. □ 

Comparison with PAC-Bayes bounds. The by now quite well-established 
PAC-Bayes bounds ([5], see also ^ and references therein, and 0121 for recent 
developments) deal with a similar setting of randomized classifiers. In these 
bounds typically comes a complexity term of the form D{9s\\n), D denoting 
the KL divergence. If we forget about the positive part, the expectation of the 
second term in the above bound with respect to the drawing of h is precisely 
D{9s\\y)- We actually deliberately picked priors and bad events in the above 
proposition in order to obtain a result that is formally as close as possible to 
a tight expression of the PAC-Bayes bound given in [Hj, Theorem 5.1. The 
similarity is striking, so that a discussion is in order. 

• PAC-Bayes bounds are generally concerned with bounding the average 
error E/i^gg.^ ['S'(^)] of the randomized procedure. Occam's hammer, on the 
other hand, bounds directly the true error of the randomized output. In other 
words, Proposition|3|appears (almost) as a pointwise version of [5], Theorem 5.1; 
this is an essential difference. Pointwise results using the PAC-Bayes approach 
have also appeared in recent work it is not entirely clear to us however 
if the methodology developed there is precise enough to recover a pointwise 
version of Theorem 5.1. The point of the present discussion is that, while 
these different bounds have an identical behavior in an asymptotic point of view, 
it is important for practice to have bounds that are as sharp as possible at finite 
horizon. We believe the Occam's hammer approach could be particularly useful 
to this regard, and plan to make an extensive comparison on simulations in 
future work. 

• Technically, PAC-Bayes bounds more or less rely on two main ingredients: 
(1) the entropy extremal inequahty Ep [X] > log Eg [e^] -I- D{P\\Q) and (2) 
inequalities on the Laplace transform of i.i.d. sums. Occam's hammer is, in a 
sense, less sophisticated since it only relies on simple set measure manipulations 
and contains no exponential moment inequality argument. On the other hand, 
it acts as a 'meta' layer into which any other bound family can be plugged in. 
These could be inequalities based on the Laplace transform (Chernoff method), 
or not: in the above example, we could have plugged in the binomial tail inver- 
sion bound (which is the most accurate deterministic bound possible for esti- 
mating a Bernoulli parameter). In classical PAC-Bayes, there is no such clear 
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separation between the bound and the randomization; they are intertwined in 
the analysis. 

We hope this short discussion is enough to convince that Occam's hammer 
and PAC-Bayes bounds, although closely related, are of a somewhat different 
nature. Apparently one does not subsume the other, although we certainly 
believe that the relation between the two should be explored more thoroughly 
in future work. 



3.2 Multiple testing: a family of "step-up" algorithms 
with distribution-free FDR control 

We now change gears and switch to the context of multiple testing. Ti. is now 
a set of null hypotheses concerning the distribution P. In this section we will 
assume for simplicity that Ti is finite and the volume measure fi is the counting 
measure, although this could be obviously extended. The goal is, based on 
oberved data, to discover a subset of hypotheses which are predicted to be 
false (or "rejected'). To have an example in mind, think of microarray data, 
where we observe a small number of i.i.d. repetitions of a variable in very high 
dimension d (the total number of genes), corresponding to the expression level 
of said genes, and we want to find a set of genes having average expression level 
bigger than some fixed threshold t. In this case, there is one null hypothesis h 
per gene, namely that the average expression level for this gene is lower than t. 

We assume that we already have at hand a family of tests T{X,h,a) of 
level a for each individual h. That is, T{X,h,a) is a function taking values 
in {0,1} (the value 1 corresponds to "null hypothesis rejected") such that for 
all h € H, for all distributions P such that h is true, Pjc^p [T{h, a) = 1] < a . 
To apply Occam's hammer, we suppose that the family T(h, a) is increasing, 
i.e. a > a' T(/i, a) > T{h, a') . This is generally statisfied, as typically tests 
have the form T{X,h,a) = l{F{h,X) > </<(«)}, where F is some test statistic 
and 4){q) is a nonincreasing threshold function (as, for example, in a one-sided 
T-test). 

For a fixed, but unknown, data distribution P, let us define 

^0 = {h E H : P satisfies hypothesis h} 

the set of true null hypotheses, and Hi = H\Ho its complementary. An impor- 
tant and relatively recent concept in multiple testing is that of false discovery 
rate (FDR) introduced in Let A : X i-^ Ax C H be a procedure returning 
a set of rejected hypotheses based on the data. The FDR of such a procedure 
is defined as 

^ {AxnUal 
. \Ax\ 

Note that, in contrast to our notion of FPR introduced in section the FDR 
is already an averaged quantity. A desirable goal is to design testing procedures 
where it can be ensured that the FDR is controlled by some fixed level a. The 
rationale behind this is that, in practice, one can afford that a small proportion 



FDR{A) 
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of rejected hypotheses are actually true. Before this notion was introduced, in 
most cases one would instead bound the probability that at least one hypothesis 
was falsely rejected: this is typically achieved using the (uniform) union bound, 
known as "Bonferroni's correction" in the multitesting literature. The hope is 
that, by allowing a little more slack in the acceptable error by controlling only 
the FDR, one obtains less conservative testing procedures as a counterpart. We 
refer the reader to |H| for a more extended discussion on these issues. 

Let us now describe how Occam's hammer can be put to use here. Let tt be a 
probability distribution over 7 be a probability distribution over the integer 
inteval [1 . . . |7i|], and /3(fc) = X]i<fc *7(*)- Define the procedure returning the 
following set of hypotheses : 

A: X ^ Ax =snv{G dU: \/h e G, T{X, h, aTT{h)(3{\G\)) = 1} . (1) 

(This type of procedure is called "step-up" and can be implemented through a 
simple water-emptying type algorithm; see also the discussion below.) We have 
the following property: 

Proposition 4. The set of hypotheses returned by the procedure defined by 
has its false discovery rate hounded by T:{Tio)a < a. 

Proof Define the collection of "bad events" B{h,6) = {X : T{h,6){uo) = 1} if 
h G Tio, and -B(/i, 5) ~ % otherwise. It is an increasing family by the assumption 
on the test family. Obviously, for any G C 7i, and any level function A: 

Gr\{h(En:X ^ B{h, A{h, \G\))} = GnHQn{h e H : T{X, h, A{h, \G\)) = 1} ; 

therefore, if G C {h e H : T{X, h, A{h, G)) = 1}, it holds that 

\Gn{heH:X e B{h, A{h, \G\))} \ = \G n Ho\ . 

Since Ax satisfies the above condition, the averaged FPR for level function A co- 
incides with the FDR. Define the modified prior TT{h) = l{h G Ha}TT{Ho)~^TT{h). 
Apply Occam's hammer with priors fx, tt, 7 and 6 = Tr{Ti.Q)a to finish the 
proof. □ 

Interestingly, the above result specialized to the case where tt is uniform 
on H and 7(1) = n^^i^^, k — X]i<|-H|* "'^ results in /?(i) — K^^i, and yields 
exactly what is known as the Benjamini-Yekutieli (BY) step-up procedure 0]. 
Unfortunately, the interest of the BY procedure is mainly theoretical, because 
the more popular Benjamini-Hochberg (BH) step-up procedure [3j is generally 
preferred in practice. The BH procedure is in all points similar to BY, except 
the above constant k is replaced by 1. The BH procedure was shown to result 
in controlled FDR at level a if the test statistics are independent or positively 
correlated 0]. In contrast, the BY procedure is distribution-free. Practitioners 
usually favor the less conservative BH, although the underlying statistical as- 
sumption is disputable. For example, in the interesting case of microarray data 
analysis, it is reported that the amplification of genes during the process can 
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be very unequal as genes "compete" for the amount of polymerase available. 
A few RNA strands can "take over" early in the RT-PCR process, and, due 
to the exponential reaction, can let other strands non-amplified because of a 
lack of polymerase later in the process. Such an effect creates strong statisti- 
cal dependencies between individual gene amplifications, in particular negative 
correlations in the oberved expression levels. 

This dicussion aside, we think there are several interesting added benefits in 
retrieving the BY procedure via Occam's hammer. First, in our opinion Occam's 
hammer sheds a totally new light on this kind of multi-testing procedure as the 
proof method followed in 0] was different and very specific to the framework 
and properties of statistical testing. Secondly, Occam's hammer allows us to 
generalize straightforwardly this procedure to an entire family by playing with 
the prior tt and more importantly the size prior 7. In particular, it is clear 
that if something is known a priori over the expected size of the output, then 
this should be taken into account in the size prior 7, possibly leading to a 
more powerful testing procedure. Further, there is a significant hope that we 
can improve the accuracy of the procedure by considering priors depending on 
unknown quantities, but which can be suitably approximated in view of the 
data, thereby folowing the general principle of "self-bounding" algorithms that 
has proved to be quite powerful ([7], see also [Sj E| where this idea is used as 
well under a different form, called "localization"). This is certainly an exciting 
direction for future developments. 



4 Tightness — the sharp edge of the hammer 

It is of interest to know whether Occams' hammer is accurate in the sense 
that the bound can be achieved in some (worst case) situations. A simple 
argument is that Occam's hammer is a generalization of Occam's razor: since 
the razor is sharp 0, so is the hammer. . . This is somewhat unsatisfying since 
this ignores the situation Occam's hammer was designed for. In this section, 
we address this point by imposing an (almost) arbitrary inverse density prior 
u and exhibiting an example where the bound is tight. Furthermore, in order 
to represent a "realistic" situation, we want the "bad sets" B{h, a) to be of 
the form {Xh > t{h, a)} where Xh is a certain real random variable associated 
to h. This is consistent with situations of interest described above (confidence 
intervals and hypothesis testing). We have the following result: 

Proposition 5. Let Ti. = [0, 1] with interval extremities identified (i.e. the unit 
circumference circle). Let v be a probability distribution on [0, 1], and ao 6 [0, 1] 
be given. Put f3{x) — udv{u). Assume that j3 is a continuous, increasing 
function. Then there exists a family of real random variables (Xh)hGH 1 having 
identical marginal distributions P and a random subset A C [0, 1] such that, if 
t{a) is the upper a-quantile of P (i.e., P{X > t{a)) = a), then 

\{h€A and Xh > t{aol3{\A\))}\' 
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Furthermore, P can be made equal to any arbitrary distribution without atoms. 

Comments. In the proposed construction (see the proof in appendix), the 
FPR is a.s. equal to ao , and the marginal distribution of \A\ is precisely v. This 
example shows that Occam's hammer can be sharp for the type of situation it 
was crafted for (set output procedures), but is not entirely satisfying for two 
reasons. The first one is that the way A is constructed is somewhat artificial: 
it would be more convincing if A was selected by some criterion based purely 
on the observed data {Xh) ■ A more problematic point is that in the above 
construction, we are basically oberving a single sample of (X/i) , while in most 
interesting applications we have statistics based on averages of i.i.d. samples. 
If we could construct an example in which (X^) is a Gaussian process, it would 
be fine, since observing an i.i.d. sample and taking the average would amount 
to a variance rescaling of the original process. In the above, although we can 
choose each to have a marginal Gaussian distribution, the whole family 
is unfortunately not jointly Gaussian (inspecting the proof, it appears that 
for h ^ h' there is a nonzero probability that Xt = Xh' , as well as Xh ^ 
Xh' , so that {Xh,Xh') cannot be jointly Gaussian). Finding a good sharpness 
example using a Gaussian process (e.g. using some suitable modification of the 
Brownian bridge process, maybe having the same covariance structure as the 
above construction) is an interesting open problem. 

5 Conclusion 

We hope to have shown convincingly that Occam's hammer is a powerful and 
versatile theoretical device. It allows an alternate, and perhaps unexpected, 
approach to PAC-Bayes type bounds, as well as to multiple testing procedures. 
The fact that we retrieve exactly the BY distribution-free multitesting proce- 
dure and extend it to a whole family shows that Occam's hammer has a strong 
potential for producing practically useful bounds and procedures. In particular, 
a very interesting direction for future research is to include in the priors knowl- 
edge about the typical behavior of the output set size. At any rate, a significant 
feat of Occam's hammer is to provide a strong first bridging between the worlds 
of learning theory and multiple hypothesis testing. 

Finally, we want to underline once again that, like Occam's razor, Occam's 
hammer is a meta device that can apply on top of other bounds. This feature 
is particularly nice and leads us to expect that this tool will prove to have 
meaningful uses for other applications. 

6 Appendix — proofs 

Proof of Theorem ^ The proof of Occam's hammer is in essence an inte- 
gration by parts argument, where the "parts" are level sets over X x Ti. oi the 
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output density 9x{h). We have 



x^PM^e^.^. [1{X e B{h, A{h, 9x{h)))}] 

1{X e B{h, A{h, ex{h)))}ex{h)dfi{h)dP{X) 

(XJi) 

f 1{X eB{h,A{h,9x{h)))} f y-H{y>ex{h)-^}dydP{X)dfi{h) 
y-^ I 1{X e B{K A{h, 9x{h)))}l{9x{h) > y-^}dP{X)dfi{h)dy 

y>Q J{X,h) 

<[ I 1{X eB{h,A{h,y-^))}dP{x)dfi{h)dy 

Jy>Q J{X,h) 

y-^ f Fx^P [Bih,mmiS7rih)P{y), 1))] dfxih)dy 

y>() Jh 

< r y-^Sp{y) f n{h)df,{h)dy 

Jy=0 Jh 



S / l{u < y}y '^udyd'-f{u) = S dj{u) = S . 

Jv>0 Ju>0 Ju>0 



ly> 

For the first inequality, we have used assumption (A') that B{h^ 5) is an increas- 
ing family and the fact A{h, 9) is a nonincreasing function in 9 (since (3 is an 
nondecreasing function). In the second inequality we have used the assumption 
on the probability of bad events. □ 
Proof of PropositionlSl Letz/andao be fixed. We will construct explicitly 
the family (Xh)heH ■ First, let us denote Q the image probability distribution 
on [0, ao] of v by the linear rescaling a; i— > agx . Now, let a; be a random variable 
uniformly distributed in [0, 1] and u an independent variable with distribution 
Q . We now define the family {Xh) given (a;, u) the following way: 



X,, 



G{u) if /i G [x,x 
Y otherwise, 



where G{u) is an increasing real function [0, 1] [T, +oo) , and y is a random 
variable independent of {x, u) , and with values in (— oo, T] . We will show that it 
is possible to choose G, Y, T to satisfy the claim of the proposition. In the above 
construction, remember that since we are working on the circle, the interval 
[x, X + u] should be "wrapped around" if a; + u > 1 . 

First, let us compute explicitly the quantile t(a) of Xh for a < ao . We have 
assumed that Y <T a.s., so that for any h G H , t > T , 

P [Xh X] = E„ [P [Xh > t[u]] = E„ [P [G{u) >t;he[x,x + u] [u]] 
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Setting the above quantity equal to a, entails that t{a) = G{aof3 ^{a^ a)) . 
Now, let us choose A — [x,x + aQ^u] . Then |^| = a^^u , hence 

t{aoP{\A\)) = G{aop-Hao'ao(i{ao'u))) = G{u) . 

This entails that we have precisely An {h : Xh > t{ao{(i{\A\)})} = [x,x + u] , 
so that \ {h € A and Xh > t{aoP{\A\)} \ \A\~^ = ao a.s. Finally, if we want a 
prescribed marginal distribution P for Xh, we can take T as the upper uq- 
quantile of P , y a variable with distribution the conditional of P{x) given 
X <T , and, since f] is continuous increasing, we can choose G so that t{a) 
matches the upper quantiles of P for a < ao . □ 
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